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Abstract: 

Longitudinal data tracking repeated measurements on individuals are 
highly valued for research because they oflfer controls for unmeasured indi- 
vidual heterogeneity that might otherwise bias results. Random effects or 
mixed models approaches, which treat individual heterogeneity as part of 
the model error term and use generalized least squares to estimate model 
parameters, are often criticized because correlation between unobserved 
individual effects and other model variables can lead to biased and incon- 
sistent parameter estimates. Starting with an examination of the relation- 
ship between random effects and fixed effects estimators in the standard 
unobserved effects model, this article demonstrates through analysis and 
simulation that the mixed model approach has a "bias compression" prop- 
erty under a general model for individual heterogeneity that can mitigate 
bias due to uncontrolled differences among individuals. The general model 
is motivated by the complexities of longitudinal student achievement mea- 
sures, but the results have broad applicability to longitudinal modeling. 
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1. Introduction 

Longitudinal data are highly valued in all areas of social science research. 
In education research in particular, the rapidly growing availability of data 
tracking student achievement over time has made longitudinal data analysis 
increasingly prominent. Longitudinal data analysis is now common practice in 
research on identifying effective teaching practices, measuring the impacts of 
teacher credentialing and trainin g, a nd evaluating other educational interven- 
tions 11; [13; H [la S [ii; [il 111). In fact, the United States Department 
of Education funded a center dedicated to longitudinal analyses in education 
("http : //www . caldercenter . org/). Recent computational advances and em- 
pirical findings about the impacts of individual teachers have also intensified 
interest in "value-added" methods (VAM), where trajectories of students' test 
scores as they progress through schooling are used to estimate the contributions 
of individual teachers or schools to student achievement ((^;Ji.; .23; 24; 27; 34.: |4]1) . 
With teacher and school accountability at the forefront of education policy, and 
with educators and researchers seeking more sophisticated ways of putting test 
score data to good use, longitudinal methods are likely to remain critical to 
education research. 

One of the most important attributes of longitudinal data for research is 
that they can lead to less biased and more precise estimates of the effects of 
substantive variables on individual outcomes than is generally possible with 
purely cross-sectional observational data. For example, in education research, 
analysts have consistently found that observable background characteristics of 
students typically available in administrative databases, such as race/ethnicity, 
socio-economic status indicators, and limited English proficiency status, are 
correlated with student achievement, but that there remains substantial unex- 
plained heterogeneity among students in achievement profiles after accounting 
for these observable characteristics (|20l ). In observational studies with cross- 
sectional data, this unmeasured heterogeneity threatens to bias estimates of the 
effects of educational variables being studied (e.g. teacher characteristics such 
as certification and years experience) because of non-random allocation of stu- 
dents to educational settings such as schools, classrooms, or programs. However, 
the repeated measures on individuals inherent to longitudinal data provide op- 
portunities to control for this unmeasured heterogeneity, thereby improving the 
quality of the estimates. 
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Statisticians and economists often take different approaches to longitudinal 
data analyses. Common approaches in the statistical literature focus on model- 
ing growth curves and modeling unobserved heterogeneity as part of the error 
structure for the data using random effects or mixed models (0; EH; [13; El). 
Economists have tended to focus more the potential biasing effects of unob- 
served heterogeneity and fixed effects approaches to remove that bias under 
the appropriate assumptions. Fixed effects approaches introduce parameters for 
each individual as part of the model mean structure, rather than the error struc- 
ture SB; a. The divide between fixed and random effects approaches is 
particularly strong in education research, with numerous examples of economists 
using fixed effects (14; 15; 2^; 25; H; 33; 4^; H), and educational statisticians 
using random effects or mixed model approaches by way of hierarchical linear 
models and related methods (@; [H; 0; |29: 32; sj; [si; [II H Ell; [HI). 

The usual criticism of the random effects or mixed model approach is that 
when treating individual heterogeneity as part of the model error term, correla- 
tion between the unobserved individual effects and other substantive variables 
in the model can lead to biased and inconsistent estimates of the effects of those 
variables, while fixed effects approaches do not suffer these same shortcomings 
(0; [il; 0; El; [H; [HO). However, it is also the case that under the standard 
model of time- invariant individual effects, when many measures are available 
for each individual, and/or when individual heterogeneity accounts for large 
fraction of the observed variance in the measurements, the magnitude of the 
bias from the mixed model approach can be small (ISj; [22; l50( ). The goal of 
this article is to demonstrate that the bias from the mixed model approach can 
also be small under a model that generalizes the standard unobserved effects 
model by allowing for multiple time-invariant individual parameters that are 
related in time-varying ways to the observed measurements. Such a model is 
motivated by the complexities of longitudinal student achievement data but has 
broad applicability to other outcome variables in other research areas. Obtain- 
ing consistent estimates of parameters under this model would generally require 
more sophisticated methods and assumptions. Thus, the practical conclusion 
from this article is that there are circumstances in longitudinal analyses where 
the mixed model approach can provide reasonable conclusions that are robust 
across potentially complex structural models for individual heterogeneity, even 
if that heterogeneity is correlated with the substantive variables being studied. 

We begin by defining the standard model with time-invariant individual pa- 
rameters in Section [2l We compare the fixed and random effects estimators 
under this model, and briefly review results about the conditions under which 
the two approaches will lead to similar estimates. In Section [3] we then general- 
ize the standard unobserved effects model to a general mixed model and show 
how, under mild assumptions, the mixed model approach has a certain "bias 
compression" property that can mitigate bias due to uncontrolled differences 
among individuals. We provide simulation examples in Section[4l representative 
of common empirical scenarios, that demonstrate the implications of our an- 
alytical results. We conclude with a discussion of practical considerations and 
suggestions for future research in Section O 
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2. Fixed and Random Effects Estimators in the Standard Case 

In this section we consider fixed and random effects estimators under the 
standard structural model of time-invariant individual effects. We briefly re- 
view results about the relationship between the estimators and its dependence 
on the number of observations for each individual and the strength of their 
correlation within individuals. We then provide an alternative view of the rela- 
tionship between the estimators that facilitates comparison to the more complex 
case considered later in the article. Consistent with our theme of applications to 
student achievement, we refer to the individuals as students and the measure- 
ments as achievement test scores, but all results hold more generally. 

2.1. Model Specification 

We assume that T achievement measures are tracked for each of n students. 
The students may represent multiple cohorts but for simplicity of the descrip- 
tion we act if is the students belong to a single cohort. The model in this section 
is most applicable when these achievement measures are taken on a single sub- 
ject (e.g. reading or mathematics) at different time points, and thus we refer to 
measurements t = 1, . . . , T in terms of time. Often t = 1, . . . ,T corresponds to 
grades but this is not required; for example, some schools and districts admin- 
ister multiple assessments of the same academic subject during each grade. We 
let Yi = (Fii, . . . ,Yit) denote the vector of achievement scores for student i 
and posit the model 

Y, = Z,9 + 16, + e, (1) 

The design matrix Zi is {T x k) and has an associated (fc x 1) parameter 9 
which is unknown and is the objective of inference. Note that in general, k may 
be a function of T because of the addition of covariatcs (e.g. timepoint means) 
as T grows. Each student has a specific effect Si that applies to all scores via the 
(T X 1) vector 1. The treatment of this effect as either fixed or random in the 
process of estimating 6 is the primary consideration of this section. We make 
no assumption about the relationship of 5; to the other covariates in the model, 
leaving open the possibility that E(6i\Zi) ^ 0. The residual error term is 
assumed to be N{0,a'^I), is assumed to be independent of Si, and is assumed 
to satisfy E{ei\Zi) = (throughout the article we use I to denote identity 
matrices of conforming size). 

The design matrix Zi is general. In educational applications, Zi might typ- 
ically include time marginal means, time-invariant characteristics of individual 

^We use "achievement score" generally, and wc allow the possibility that the scores are 
actually annual gain scores (i.e. Yi^t — ^i.t— i) which are commonly used directly as outcome 
variables in the education research literature. The algebra applied to the model to produce 
the estimators is equally valid if the Yit were to be annual gain scores rather than level scores, 
but the assumptions of the model might be more appropriate for level scores than for gain 
scores. 
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teachers, time-varying teacher-level or classroom-level predictors, time-varying 
or time-invariant (if students switch schools) school factors, or time-varying 
student characteristics. Importantly, we assume that Zi does not include time- 
invariant student characteristics. In that case, the parameters for those charac- 
teristics are identified only if the individual student effects are treated as random 
effects. If such coefficients are of primary interest, then one is forced to use ran- 
dom effects 1^; [13; IHoh and the comparison of the two estimators is meaningless. 
We arc more interested in considering cases where both approaches are possible. 

The expanded version of the model in Equation [1] for the stacked vector of 
student measurements Y (of length nT) across both students and time is 

Y = Ze + D5 + e (2) 

where Z is [nT x k) obtained by stacking the Zi, £> is a (nT x n) matrix of 
indicators or dummy variables linking records to the student effects given by 
the n-vector 5, and e is a nT-vector with distribution A^(0,(t^/), independent 
of (5, and satisfies E{e\Z) = 0. 



2.2. The Fixed Effects Estimator for 

The fixed effects estimator is obtained by ordinary least squares, treating each 
Si as a model parameter. These individual student effects are nuisance. The pa- 
rameters of interest are 9 and the estimator of these parameters in the presence 
of the nuisance parameters can be obtained through a two-stage regression. In 
the first stage, we regress the student fixed effects (D) out of both Y and each 
of the columns of Z. 9p is then obtained by regression on the resulting sets 
of residualsll Letting Hd = D{D'D)^D' be the "hat" or projection matrix 
from the first-stage regression on the student fixed effects, the residuals from 
the regression of Y on £) are given by (/ — Ho)Y , and similarly the residuals 
from the regression oi Z on D are given by (J — Ho)Z. Using the fact that 
(7— Hjj) is symmetric and idempotent, the estimator Op from the second stage 
regression is 

Op = {Z'{I - Hd)Z)-^ Z'(I - Hd)Y (3) 

As will be shown later this form is also convenient for comparing the fixed effects 
and mixed model estimators. 

The transformation of Z and Y hy {I — Hp,) subtracts from the elements 
of each vector the within-student averages of those elements. For example, the 
elements of (J — Ho)Y are {Yn — 1^), where Yi is the average of the scores 
for the ith student. This differencing effectively removes the unobserved hetero- 
geneity, so that under minimal assumptions about Z the fixed effects estimator 
is consistent for 6 as the number of students gets large (3 13; 50[ ). 



^We are implicitly assuming that the matrix [Z\D\ is full column rank, which would not 
be the case if for example Z included indicator columns for all teachers. In that case, columns 
for some teachers would need to be excluded from Z and the appropriate changes would need 
to bo made to the random effects model to maintain comparability. 
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2. 3. The Random Ejfects Estimator for 6 

The random effects estimator treats the student effects 5i as random effects 
with variance . Conditional on known values of and the residual variance 
cr^ and letting R be the covariance matrix of the terms DS + e, is estimated 
by generalized least squares (GLS) via 

e„ = [Z'R-^Zy^ Z'R-^Y (4) 

Motivation for the GLS estimator and the standard properties of the estimator 
are based on the assumption that the 5i are independent of the other variables 
in the model. However, the estimator can be used even if that assumption is 
violated and the resulting estimator can still have desirable properties, which 
is one of the main themes of the article. It is also important to note that the 
estimator in Equation [5] is commonly termed the "infeasible GLS estimator" 
because it assumes known values of the variance components, which are generally 
not available. A so-called "feasible GLS estimator" includes a step that uses the 
data to obtain a consistent estimate JR of i2, and then uses R in Equation [4] 
We revisit this important distinction in both the Simulation Examples and 
Discussion Sections. 



Based on evaluation of i? ^ , it can be shown that (jl3l : |22| : [5 



eR = {Z'{I--iTHD)Z) ^Z'{I-jTHd)Y (5) 

where 

^=i + p(r-i) 

and p is student-level intra-class correlation jiy^ -I- cr^). 

Equations [5] and [3] show that for this simple model, the fixed effects and 
mixed models estimators are highly similar, deviating only by the term 7T. 
Wooldridge (fsoh refers to the transformation by (/ — ^THd) as "quasi-time 
demeaning" because it is equivalent to subtracting a fraction 7T of the within- 
student averages of the components. Moreover, as 7T approaches 1 the fixed 



effects and mixed model estimators will tend to be very similar ()13l : l22l : I5C 

Referring to the definition of 7 in Equation [6l as p 1 for fixed T, 7T 1, 
and as T 00 for fixed p > 0, 7T 1. When j) ^ 1 for fixed T, the continuity of 
matrix operations implies that we can write {6f — 6r) ~ QY where the elements 
of Q are o(l), so that (Op — Or) ^ 0. The analogous result for fixed p asT —f oo] 
namely, that {Op — Sr) ^ as T ^ 00 for fixed p > 0, also holds under some 



additional assumptions on the design matrix Z ()28l ). In practical terms, these 
results imply that if either many scores are available for each student, or if p 
is large, « 1 and the estimators should be similar across a broad range of 
Z. For example, when T = 5 and p in the range of 0.7 to 0.8, values typical of 
actual longitudinal achievement data series, 7T is in the range of 0.92 to 0.95. 
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Under the model in Equation [U the expected value of the mixed model esti- 
mator from Equation [5] conditional on Z is (given 7) is 

E{eR\Z) = + (1 - 7T) [Z' {I - -fTHo) Zy' Z'DE{8\Z) (7) 

which follows from that the fact that HjjD = D. The second term on the 
right hand side is not in general zero when E{d\Z) ^ 0, and thus the random 
effects estimator does not guarantee the elimination of the selection bias. Even 
as n ^ 00 the second term on the RHS does not in general tend to zero and thus 
the random effects estimator is not consistent, the standard result leading many 
practitioners to prefer fixed effects. However, the leading coefficient of (1 — 7T) 
on the bias term in Equation [7] indicates that the random effects estimator 
"compresses" the bias toward zero, with the degree of compression increasing 
as 7T — > 1. This provides an alternative view to quasi-time demeaning of the 
mechanism by which random effects estimators can mitigate bias. Importantly, 
this bias compression is a feature of the random effects estimator that holds 
under more general structural models considered next. 

3. Extensions to More Complex Models 

In this section we consider a generalization of the structural model in Equa- 
tion [1] that allows for multiple time-invariant individual effects that can be 
related to the measurements in ways that vary across time. Such a model is 
particularly relevant for standardized test score data, but is likely applicable to 
many kinds of outcomes used in social science research. We provide a theorem 
about the conditions under which the bias compression property of the random 
effects estimator previously demonstrated for the standard model carries over to 
mixed effects estimation of the more complex structural model. We also present 
some illustrative examples. 



3.1. Model Specification 

The model in Equation [T] assumes that the individual effect Si is related to 
each measurement in exactly the same way. This assumption is the key to the 
ability of the fixed effects estimator to remove bias due to individual heterogene- 
ity, because it implies that within-individual differences of measurements do not 
depend on Si. While this assumption may provide adequate approximations in 
many circumstances, it is unlikely to be exactly met with outcomes commonly 
encountered in education and other research areas. 

For example, the complexities of creating and scaling standardized achieve- 
ment assessments, including multidimensionality of measured constructs, con- 
tent shift over time, and vertical equating procedures (U; 17; 2^; 31; 43). may 



make the constant additive effect of Equation [T] too rigid to adequately capture 
the relationships among multiple scores taken from the same student over time, 
even if those tests are from a single test developer and are intended to pro- 
vide measurements on a common scale. For instance, mathematics tests might 
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contain some items related to algebraic concepts and other items related to 
arithmetic computations, and the proportion of items from each domain might 
change as students mature from elementary school to secondary school, thus 
resulting in achievement test scores that differentially combine student achieve- 
ment on two constructs. Even circumstances as relatively simple as differential 
reliability of assessments across time may be sufficient to invalidate the struc- 
tural model in Equation [TJ And with criterion-referenced tests (tests designed 
to measure student performance relative to absolute standards rather than rel- 
ative to other students) that are not vertically equated becoming more common 
in response to the requirements of the Federal No Child Left Behind Act, it 
is likely that many longitudinal achievement data series cannot be assumed to 
provide measures of a single, consistently scaled construct over time. 

We thus consider the following generalization to the model for the vector of 
scores for one student: 

Y, = Z,e + Ai8,+e, (8) 

The assumptions about the design matrix Zi arc unchanged from the standard 
case in Equation [1] But we generalize the model for student heterogeneity by 
replacing the scalar 5i specific to student i with a d-dimensional vector of factors 
6i, and we allow those factors to be represented in the tests as arbitrary linear 
combinations that can vary over time via the {T x d) matrix We assume 
that the factors are mean zero, normally distributed with V{5i) = Si, a, {d x d) 
positive definite matrix. The arbitrary covariancc structure for the factors allows 
them to cover such cases as different aspects of mathematics ability that might 
be positively correlated (e.g. problem solving and computation abilities). 

By allowing the Si to have multiple components. Equation [8] can account 
for multi-dimensionality of tests and changing weights on these constructs over 
measurements. For example, suppose a test measures d constructs so that scores 
depend on d factors. We let the vector Si denote the time-invariant values on 
these factors for student i. Row t of Ai contains the weights for these factors 
for the tth measurement. As the measures change, the values in the rows of 
Ai change to allow differential weighting of the factors. Examples [l] [2l and [4] 
below provide specific examples of this type of scenario for test scores. Random 
polynomial growth models, considered in Example [3l are also a special case of 
Equation [51 In this case, 6 contains the random cocfhcicnts of the polynomial 
growth model and the columns of Ai arc the polynomials of time. It is important 
to note that the random growth model is a special case because the values of Ai 
are assumed to be known, which provided that T is sufficiently large relative to 
d, allows both Si and thus the individual components of Si to be identified. In 
general, it will not be possible to separately identify Ai and Si , but that is not 

^ Block diagonal matrices appear several times in this remainder of this article. When 
the blocks arc equal across individuals we use a bold-face English letter to denote the block 
diagonal matrix and the same letter with the subscript "1" to denote an individual block. When 
the blocks differ by student we use the letter with the subscript i to denote the individual 
blocks. 
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problematic because as discussed below the mixed model estimator for does 
not require this identification. 

Another important class of models covered by Equation [8] are those that 
jointly model measurements from different tested subjects such as mathematics 
and reading. The multi-factor formulation is perfectly suited to joint longitudinal 
modeling of outcomes from different tested subjects by treating the models 



for different subjects as a set of seemingly unrelated regressions (SUR) (|5lf ) 
where the factors can represent different ability attributes relevant for different 
subjects. In this case it is important to keep in mind that our subscript t does 
not strictly represent time, but rather more generally indexes repeated measures 
on students that may be taken both across time and across subjects within time. 

The factors are assumed to be independent across students, but as before we 
allow the possibility that E{5i\Z) ^ 0. We further assume that rank{Ai) — d 
for all T , which essentially means that we have chosen a parameterization of the 
factors di with minimal dimension. If rank[Ai) = r < d, it can be shown that 
it is possible to reduce the factor to one of dimension r and recover the same 
marginal covariance structure of the student heterogeneity terms, so without loss 
of generality we assume that we are dealing with this maximally parsimonious 
representation at the outset. 

The residual error term is assumed to be mean zero, to be independent of 5i 
for each i, to be independent across students, and to satisfy E{ei\Z) = for each 
i. We let V{ei) ~ 'S'l, a positive definite matrix with diagonal elements bounded 
away from oo. In many practical cases it may be reasonable to assume that 
is diagonal, but this restriction is not required. For example, may have an 
autoregressive structure. These assumptions imply that Ri := V{Ai8i + 6,;) = 
AiSiA'i + ^i, the usual form considered in factor analysis when Si = I flUH)- 
This structural model recovers the standard model considered in Section [2] by 
taking d = 1, Si = , A\ = 1 and 'i'l = a^I. However, by allowing the 
possibility of multiple student-specific factors that link differentially to the test 
scores, and by allowing different residual variances across time points, this model 
is capable of expressing arbitrary covariance structures of the student-specific 
portion of the model. As noted, the generality of this model was motivated 
by considerations of longitudinal standardized test score data, but is relevant 
beyond this specific application. 

The expanded version of the model for all student measurements 

r = Z0 + A<5 + e (9) 

is analogous to the one presented in Equation [2j where again Z is (nT x fc) 
obtained by stacking the Zi, A = In® A\ is (nT x nd) , d is length nd obtained 
by stacking the student factor vectors Si, and e is the vector of nT residual 
errors with covariance matrix J„ ® 'S'l. The covariance matrix of the student 
portion of the model, {AS + e) is thus i? = /„ i2i = J„ (g) {AiSiA[ + 
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3.2. Bias Compression Under the General Model 

The mixed model estimator under Equation [9l conditional on the value of 
ii, is identical to the random effects estimator of Equation IH except that R 
will have more complex structure in the mixed model than it did in the simple 
random effects model. The mixed model estimator is also the GLS estimator 
assuming the residual covariance matrix R. In some circumstances R may be 
parameterized and estimated via random effects or random coefficients, for ex- 



ample, in hierarchical linear models (|37l ) , but this is not necessary. For example. 



in value-added modeling of individual teacher effects, several prominent models 



allow R to be unstructured (l41|; 1331) . Later we discuss practical considerations 
about when it is feasible to impose no structure on R and when parameter- 
reducing assumptions might be required. 

Under the structural model in Equation [9l the expected value of the mixed 
model estimator (assuming known R) is 

E{eR\Z) = 8+ {Z'R-^Zy^ Z'R-^AE{S\Z) (10) 

which should be compared to Equation [7] with the matrix R^A generalizing 
the matrix (1 — ^T)D. Assuming that the diagonals of are bounded away 
from zero, R^ is positive definite which implies that R^A ~ only in the 
degenerate case of A = 0. Thus, in general the estimator is neither unbiased 
nor consistent as n ^ oo when E{8\Z) ^ 0. However, the elements of R^A 
can be very close to 0. In Appendix A we prove the following theorem, which 
implies that in many practically relevant cases, the elements of R^A approach 
zero as T oo: 

Theorem. Let Ai and be defined as above. Then sufficient conditions for 
the elements of R^^A to go to zero uniformly as T —> oo are: 

1. The smallest eigenvalue of A'-^^^ Ai goes to infinity as T oo; and 

2. There exists a number C independent of T such that the elements an of 
^ I the symmetric square root of ^i^, satisfy X)t=i \^it\ < C for all i 

These conditions, while abstract, are met in many practical cases. For example, 
the standard model in Equation[l]has d = 1, Ai = vl with > 0, and ^ = a^I. 
Thus is the scalar {v'^/a'^)T oo as T — > oo, and the rowsums of 

are identically equal to l/cr. Thus the conditions of the theorem are met, 
reiterating the results presented in Section [2] that indicated that the random 
effects estimator converged to the bias-removing fixed-effects estimator as T 
grew. We provide a few additional examples in which the conditions are met in 
the simplified case where ^ — a^I with a > 0, which is representative of more 
general cases of * = diag{al, . . . ,a^) with < af^^^^ < < ^Ipper < oo) 
(both of which satisfy condition 2 of the Theorem). Cases with more general 
'I'l are generally analytically intractable. 

Example 1. Suppose we generalize the standard individual effects model in 
Equation [T] to allow the individual effect to be weighted differently for each 
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measurement. In this case, d = 1 and Ai is the vector (ai, . . . , ot)'- Then 
A'l'if^^^Ai = (l/cr^) 5I]t=i <^t 1 and so condition 1 of the theorem is met as long 
as the series X]t=i divergent. A sufficient condition for this divergence is 
at remain bounded away from zero as T ^ oo; that is, as long as each observed 
measurement provides an amount of information about S that is not diminishing 
to zero. 

Example 2. Suppose d = 2 and row t + 1 of A\ is ^^(T — 1 — for 
t = 0,...,T — 1. That is, the sequence of measurements gradually changes 
from all weight on the first factor to all weight on the second factor. Then 
A[^];^Ai = ^3(2^3yjtC where cn = C22 = + o{T'^) and C21 = C12 = 
^T^ + o(r^). The eigenvalues of C can be shown to be Cn ± C21, with the 
smaller eigenvalue thus being ^T^ + o{T^). Multiplying by gives that 

the smallest eigenvalue of A'j^'if^^Ai behaves like — > cx) as T 00. 

Example 3 (Random linear growth model). Suppose d = 2 and suppose that 
the columns of Ai express a random growth model parameterized such that the 
first column is (1,1,..., 1)', and the second is (1,2,..., T)'. Then A['^^^Ai is 
(T/(T^) times a matrix for which the smallest eigenvalue is bounded away from 
zero as T ^ 00, and thus A'^^^^^Ai satisfies the conditions of the theorem. 

Example 4. Suppose any d > I and suppose that for each measurement the 
rows of A are a random draw from Dirichlet distribution with parameter uj, 
a d X 1 vector such that for j = l...d ojj > and ujq ~ X]j=i'^j- Then 
^A'l'if^^ Ai 4j (r2 + cuJo;'), where H is a diagonal matrix with elements 
and c = > 0. Theorem 3 on p. 116 of Bellman (0) states 

that if Bi and B2 are symmetric with B2 is positive semi-definite, then the 
smallest eigenvalue of Bi + B2 is greater than or equal to the smallest eigenvalue 
of Bi. Letting Bi ~ fl and B2 = cljiv' gives that the smallest eigenvalue of 
^ (rj + cuju)') is greater than or equal to the smallest diagonal element of -i^Q 
which is greater than zero. Hence, the eigenvalues of A'^'if^^Ai will diverge. 

In practical terms, the theorem suggests that when many test scores are 
available for individual students, the mixed model estimator may be effective at 
mitigating bias, even if the individual heterogeneity is related to the predictors 
in the model and even if the covariance structure of the heterogeneity is more 
complex than the simple time-invariant constant offset model of Equation [TJ 
However, generally characterizing the circumstances under which the elements 
of R~^A going to zero implies that the mixed model estimator will compress 
bias as r — !■ 00 is complex because it depends on the structure of .Z. A particular 
challenge is that in most circumstances, growing T implies a growing number 
of predictors in the model (i.e. columns of Z), and in many cases those added 
predictors apply to only a single time point (e.g time point means or interactions 
of other variables with time). 

One sufficient condition to ensure bias compression as T — > 00 is that the 
sums of the absolute values of the rows of (Z'i?^^ Z) Z' arc uniformly bounded 
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by a constant (not depending on T) as T ^ oo. This is not a particularly in- 
tuitive condition, and the asymptotic result might not be well approximated 
for realistic values of T . Hence, we now consider several simulation examples 
building on the examples above to understand when the condition seems likely 
to hold and when it does not. We consider the behavior of the mixed model 
estimator to demonstrate the importance of the theorem for reducing bias in es- 
timated effects in situations similar to some applied settings in terms of design, 
the number of students n, the strength of correlation of the test scores within 
students, and T. In the examples, we include the OLS estimator to calibrate the 
strength of the selection bias. 



4. Simulation Examples 

In this section we consider a series of three simulation examples, building on 
Examples[l]to 3 in Section[3l to understand the implication of the theorem in real 
applications. The residual errors of the data generating model of each example 
{AS + e), satisfy the conditions of the theorem. For each example, we consider 
alternative values for the "treatment" variables Z to understand applications 
where the theorem will be sufhcient for mixed models to remove bias from 
non-random assignment and where it will not. We also consider different non- 
random assignment mechanisms that relate components of 8 to Z . For reference. 
Table [T] summarizes the examples that we consider in this section in terms of 
the model for student heterogeneity, the treatment assignment mechanism, and 
the configuration of the treatment variables. Additional results from these and 
other examples, including comparisons to fixed effects estimators, are available 



in Lockwood and McCaffrey (|28l ). 



4.1. Example [II continued 

The model for residual errors in this example is the same as it was in Ex- 
ample [I] of Section [3l Data for 1000 students are generated from a model where 
each student has a single student effect that is weighted differentially by each 
measure. The model used to generate the data is: 

Yu = z[^^3 + atS, + e,t 

where Ya is student i's score in time period t for t = 1 . . .T, Zn is a vector 
of independent variables that can depend on the time period and varies with 
different scenarios we consider (described below), di is a random normal variable 
with mean zero and variance one, at is a weight that varies across time, and the 
en are independent random normal variables with mean zero and variance 1 — • 
Thus the variance of the residuals after accounting for treatment effects is equal 
to 1 for all time points. The at vary from 0.7 to 0.9 as t increases from 1 to T, 
so that the correlations of the residuals within students over time are bounded 
between about 0.5 and 0.8, values consistent with real longitudinal achievement 
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Student Heterogeneity Model 


Treatment Assignment 


Treatment Scenarios 


Example [IJ 

Single factor with varying 
weights over time 


Depends on factor 


Four scenarios: 

1: Single Tx effect, students 

remain in same Tx condition 

at all time points 

2: Single Tx effect, students' 

Tx conditions vary across 

time points 

3: Time-varying Tx effects, 
students remain in same Tx 
condition at all time points 
4: Time-varying Tx effects, 
students' Tx conditions vary 
across time points 


Example \2\ 

Two correlated factors with 
weights gradually changing 
from one to the other over 
time 


Three scenarios: 

1: Tx assignment depends 

equally on both factors 

2: Tx assignment depends on 

only first factor 

3: Tx assignment depends on 

only second factor 


Treatment only in final year 


Example \S\ (Teacher Ef- 
fects) 

Linear growth model with 
positively correlated random 
slopes and intercepts 


Assignment to teachers de- 
pends on both intercepts and 
slopes 


Consider one to four tested 
subjects per year and four dif- 
ferent models for teacher ef- 
fect persistence 



Table 1 

Summaries of student heterogeneity model, treatment assignment mechanism, and treatment 
exposure scenarios for the simulation examples. 



data. If the at were constant then the model would be the standard student 
effect model, so this example is a slight deviation from that case. 

To explore how Z affects the implications of the theorem, we consider four 
scenarios for the Za. All four involve dichotomous treatment indicators where 
the log odds of treatment assignment equal Si - that is, student assignment 
to treatment is non-random. In scenario 1, a student's treatment assignment 
remains the same for all values of t and there is a single treatment effect that is 
constant for all values of t. In scenario 2, a student's treatment assignment can 
vary across time periods, but there is again only a single treatment effect that 
is constant for all values of t. In scenario 3, a student's treatment assignment 
remains the same for all values of t, but the mean and the treatment effect vary 
with t. In scenario 4, a student's treatment assignment can vary across time 
periods and the mean and the treatment effect vary with t. For each scenario, we 
consider estimators for T =3, 5, 10, 15, and 20 with 100 Monte Carlo iterations 
run at each value of T. 

Figure [1] presents a standardized measure of absolute bias for each estimator 
(OLS and mixed model) as a function of the scenario and the number of scores. 
For the mixed model estimator, we use the known value of R; we consider the 
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Fig 1. ( ExamvleU\l Average absolute standardized bias as a function ofT. Different scenarios 
are shown in different frames. Each frame contains the average absolute standardized bias for 
OLS (plus signs) and mixed model (open circles). 



behavior of the estimator when R also must be estimated in Example [2] The 
standardized bias measure for a given estimator, scenario and number of scores 
is calculated as follows. Each Monte Carlo simulation provides estimates of the 
treatment effects (of which there are T for scenarios 3 and 4), the absolute 
values of which arc then divided by the marginal standard deviation in the 
scores. This makes the metric intcrpretable in terms of standardized effect sizes. 
For scenarios 3 and 4, these standardized quantities arc then averaged over the 
multiple treatment effects of these scenarios. Finally the resulting quantities are 
averaged across Monte Carlo simulations. 
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As shown in the figure, the mixed model estimator greatly reduces the bias 
relative to OLS in scenarios 2 and 4 in which treatment assignment varies over 
time. The bias decreases as T increases in accordance with the theorem. The 
result holds in scenario 4 even though the number of treatment effects is growing 
with T, indicating that the bias compression is acting effectively across the en- 
tire suite of treatment effects. However, when treatment assignment is constant 
(scenarios 1 and 3; top row of Figure [1]) , the mixed model estimators are biased 
and the bias does not decrease with T. In these scenarios, the mixed model es- 
timator is similar to the OLS estimator and performs about the same. The row 
sums of the absolute values of the elements of the matrix {Z' Z)~^ Z' ap- 
pear to grow with T when the treatment assignment is constant but not when 
it is not. The reason for the difference between the estimators is that when 
treatment assignment is constant each student's contribution to Z' Rr^ Z is not 
growing in T because of cancellation between the negative and positive elements 
of . When treatment assignment varies, cancellation does not occur and the 
elements of Z' Z grow with T. This behavior alternatively can be viewed in 
terms of available information to estimate 5i - only when treatment assignment 
varies can the data provide the information about 5i necessary to correct for 
selection. We conjecture that as long as both the at and the overall fraction of 
non-treatment scores remain bounded away from zero as T cx), the bounded 
rowsums condition will be met and the estimator will be consistent. 

4-2. Example\^ continued 

This example uses the model for residual errors introduced in Example [2] 
of Section [3] in which 5 has two components and the contributions of these 
components to scores gradually changes across tests from one factor to the 
other. Such a situation might occur if math tests measure two constructs such 
as computation and problem solving, but the weight given to the two constructs 
varies systematically across tests. For example, tests for elementary students 
might focus more on computations whereas tests for middle school students 
might focus more on problem solving. 

The Z used in this model was motivated by a case that is common is ap- 
plications where T — 1 test scores are collected on students prior to treatment 
and then a fraction of the students receive treatment with treatment assign- 
ment depending on unobserved characteristics of the student. It is clear that 
this Z will meet the sufficient condition on the {Z'RZ)~^ Z' and we showed in 
Section [3] that the assumptions of the theorem are met. Hence, for large T, the 
treatment effect estimated from the mixed models estimator should be approx- 
imately unbiased. The motivation for this example is to explore the properties 
of the estimator when T is small, to explore the impact of the treatment as- 
signment mechanism on bias, and to explore the impact of estimating R on the 
performance of the mixed model estimator. 

The score Yit for student i on test t = 1, . . . T is generated according to the 
model 

Yit = /Ut + ZitP + aitSii + a2tSt2 + e^. 
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The are the mean scores for ah students by year, and (3 is the effect of 
treatment and is the parameter of interest. The variable za is a treatment 
assignment indicator, and because aU but the final test are pre-treatment it 
is identically equal to zero for all t < T. For the roughly one half of the students 
who receive treatment z^t = 1 and for all other students it is zero. The weights 
ait are an evenly spaced sequence of values from .1 to .9 and the weights a2t are 
an evenly spaced sequence of values from .9 to .1. The Si are bivariate normal 
random variables, with mean zero, variance one and correlation .5. The are 
i.i.d. iV(0, .2) variables. These values imply that correlations among observations 
from the same student vary from about 0.5 to 0.8 with mean around 0.75. 

Using this model, for a sequence of Ts ranging from 2 to 20, we generated 
100 Monte Carlo samples of 1,000 students independently under three scenarios 
for treatment assignment. In the first scenario, treatment assignment depends 
equally on dn and di2 with the log of the odds of treatment equal to Adu + A5i2. 
In the second scenario, treatment assignment depends only on Sn with the log of 
the odds of treatment equal to .45^1 and in the third scenario, treatment assign- 
ment depends only on 5i2 with the log of the odds of treatment equal to A5i2- 
For each scenario, we consider the bias in the estimated the treatment effect 
using OLS and the mixed model estimator using both the known R as well as 
R estimated from the data using restricted maximum likelihood implemented in 
SAS PROC MIXED. We also repeated the estimation using 5,000 observations. 
The results are extremely similar to the cases with 1,000 and are not presented. 



Assignment on 5i& 62 
n = 1000 



+■+ + + +■+■+■ 



Assignment on 5i 
n.1000 



Assignment on 82 
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Fig 2. (Example 0) Standardized absolute bias as a function of T. Scenarios 1 to 3 are 
presented in the panel going from left to right. Each frame contains the average absolute 
standardized bias for OLS (plus signs), mixed model with known R (open circles), and mixed 
model with estimated R (X's). 



Figure [H gives the absolute bias in the estimated treatment effect standard- 
ized by the marginal standard deviation in scores for the last period for the 
three estimators. For all three scenarios, students assigned to treatment have 
significantly higher S values and the OLS estimator has substantial bias, es- 
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pecially in scenarios 1 and 2. For scenario 1, in which treatment assignment 
depends equaUy on both elements of 5, the mixed model estimator has rapidly 
decreasing bias as a function of T and the bias is near zero for T greater than 10. 
In scenario 2, treatment assignment only depends on 5i, and again the mixed 
model estimator has bias approaching zero as T grows, but it declines more 
slowly than scenario 1 because more recent tests more heavily weight 5i . In con- 
trast, treatment assignment in scenario 3 depends only on S2 which has greater 
weight in the early scores than the later scores. The OLS estimator using only 
the test score from the final period in which 62 is severely downweighted leads 
the bias in OLS to be smaller than in the other two scenarios. Similarly, the 
small weights on 62 in final period result in very low bias for the mixed effects 
estimator for all values of T. 

In all three scenarios and at all values of T, the performance of the mixed 
model estimator with known R is very similar to the performance of the esti- 
mator with R estimated from the data. In scenarios 1 and 2, there is a trend 
for performance of the estimator with known R to improve relative to the es- 
timator with an estimated R at larger values of T but the differences between 
the estimators are always less than an percentage point or two. The difference 
for larger values of T might be due to imprecision in estimating such a large 
covariance matrix but the similarity at T = 20 is very encouraging because the 
covariance matrix has 210 parameters, but even with as few as 1000 students it 
estimated with sufficient precision to effectively mitigate the bias. 

4-3. ExamplelS^ continued: Teacher Effect Estimation 

One of the primary motivating examples for this article is how best to model 
student heterogeneity when estimating individual teacher effects, as policy inter- 
ests in using such estimates to reward or sanction teachers is intensifying. The 
estimation of teacher effects using multivariate longitudinal models is extremely 
difficult to treat analytically. In the simplest case without any other covariates 
in the model, the columns of Z correspond to effects of individual teachers on a 
particular subject, and the elements of the matrix link each student test score 
to the current and past teachers effects that may have contributed to it. The 
complexity of Z arises from the crossing of students with teachers as students 
progress through grades, the assumed model for the persistence of past teacher 
effects into future test administrations, and the fact that the number of columns 
of Z grows as more grades and/or subjects are considered. Thus we examine 
how the mixed model approach behaves through a sequence of simulations. 

We consider n = 1000 students followed for a period of G consecutive grades, 
with achievement measured once per grade on S different academic subjects 
(e.g. reading, mathematics, science, etc). Thus students have a total of T = GS 
scores. The students are assigned to teachers in each grade, and those teachers 
are linked to all of the subject scores in each grade. We assume that each class 
has 25 students, so in this case there are n/25 = 40 teachers per grade and 
thus 40G teachers in total across all grades. A separate effect is estimated for 
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each teacher for each subject, so that the total number of teacher effects being 
estimated is 40(75. 

To simphfy the evaluation of the simulation, we generate the data in such 
a way that there are truly no teacher effects. The score Yisg for student i on 
subject s in grade g (indexed from to (G — 1)) is generated according to the 
random growth model 

Yisg = (5j + + (Aj + \s)g + e.isg 

similar to that presented in Example [3] in Section [31 except it applies to multiple 
subjects and the trajectories from different subjects arc correlated through the 
shared intercept parameter 5i and shared slope parameter A^. We assume the 
following independent distributions for the parameters, where all parameters 
are also independent across students: 

((5i,Ai) ~ iV(0,S) with Ell = cr|,E22 = 0-^,^21 = = mao-A 
(5,1... ^.5- iid N{Q,vj) 
X,i...Ks- iid iV(0,i/|) 

e,sg^ iid iV(0,(72) (11) 

The only parameters that are allowed to be correlated are the common intercept 
and slope parameters, which have correlation r. Under this model, the marginal 
variance of the scores is the same for all subjects in a given grade, and is (ct| + 
Vg) + g'^{(j\ + y\) + 2gras(J\ + (J^ for g = 0, . . . , (G — 1). In general scores are 
correlated across both subjects and grades, with covariances determined by the 
parameters and the time lags. 

Students are regrouped into classes each grade. We introduce spurious teacher 
effects by making these assignments non-random, and in particular making as- 
signments dependent on the parameters 5i and A;. For each student in each 
grade, we calculate the quantity 

r,,g = 0.3iS,/as) + 0.3{K/<^x) + OA^.g 

where are independent standard normal variables. For each grade, we assign 
the smallest 25 rjig to class 1, the next smallest 25 rjig to class 2, etc, all the way 
to the largest 25 r]ig to class 40. This results in selection into classrooms that is 
moderately strong on both student intercepts and student slopes. 

In order to estimate teacher effects from these data (the true values of which 
are identically zero), we need to make assumptions about the design matrix Z 
that links teacher effect indicators given by the 40GS' columns to sequences of 
test scores for students. We assume in all cases that the teacher effect for a 
given subject affects only the test scores (potentially current and future) for 
that same subject. For each subject, we assume that the effect of a teacher 
experienced in grade gi persists into grade g2 > gi by the amount a^^~^^ for 
some < a < 1. The parameter a is taken to be the same for all subjects. 
The case a ~ corresponds to no persistence of past teacher effects, the case 
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a = 1 corresponds to complete persistence of past teacher effects (i.e. the as- 
sumption made by tlie Tennessee Value Added Assessment System (TVAAS) 
"layered model" of Sanders et al. (fill )'), and < a < 1 corresponds to decaying 
persistence depending on the lag. The design matrices Zi are fully determined 
given the sequence of class assignments and a. 

Our simulation uses G = 5, and then varies the number of subjects per grade 
from 1 to 4 and also considers values of a of 0, 0.3, 0.7 and 1, for a total of 16 
design points. We use a common set of variance components and selection model 
parameters across all design points. In particular we set = 0.5, a\ = 0.125, 
r = 0.3, = 0.2, i/| = 0.05, = 0.8, which leads to R with marginal 
variances of 1.500, 1.825, 2.500, 3.525, and 4.900 by grade for each subject, and 
correlations (both within and across subjects) ranging from about 0.3 to 0.8 
with an average of 0.5. 

For each design point, we independently generate 100 datasets, and for each 
dataset we consider three different estimators for the teachers effects: completely 
unadjusted classroom means, the OLS estimator, and the mixed model estima- 
tor using the known value of R. The unadjusted classroom means are provided 
as a strawman to calibrate the strength of selection of students to teachers, like 
the OLS estimators in the other Examples. For each estimator, rather than sum- 
marize the bias in each individual teacher effect, we report estimated variance 
components of teachers by grade, expressed as a percentage of the corresponding 
marginal variance in that grade. This standardized measure of estimated teacher 
variability is commonly used in the literature to summarize the heterogeneity 
of teacher effects on the scale of student measurements (HI; [25; 33; H; H). Be- 
cause our data generating model contains no true teacher effects, the correct 
value is zero and percentages close to zero indicate that an estimator is behav- 
ing effectively. Because of the simplified balanced design of our simulations, the 
behavior of the estimators is exchangeable across subjects for a given total num- 
ber of subjects and thus we average the estimated variance components across 
subjects within grade. 

The results are summarized in Figure [3l Each frame of each plot corresponds 
to a different value of a and presents the estimated teacher variance components 
by grade and for numbers of subjects of 1, 2, 3 and 4. The lines connect the 
estimates for a given number of subjects, with the dotted lines corresponding 
to unadjusted means, dotdash lines corresponding to OLS, and solid lines cor- 
responding to the mixed model estimator. The bias in the unadjusted means 
and OLS estimators are (up to Monte Carlo error) invariant to the number of 
subjects, while the bias in the mixed model estimator decreases as the number 
of subjects increases. As such, the lowest mixed model trajectory in each frame 
corresponds to 4 subjects, and the highest corresponds to 1 subject. 

The unadjusted means indicate that the spurious variation among teachers 
increases across grades, which makes sense because students are selected into 
classrooms partially on the basis of growth. When a = 0, OLS is equivalent 
to the unadjusted means but diverges from it as a increases. The decreasing 
bias in OLS as a increases for grades after the first grade is because the OLS 
estimator approaches a first-differenced (gain score) estimator as a goes to one. 
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Fig 3. (Examvle [gl teacher effects). Estimated teacher variance components, expressed as a 
fraction of the marginal variance by grade, for unadjusted classroom means (plus signs), OLS 
(X's) and mixed model (open circles). Each frame of the plot corresponds to a different value 
of a. Multiple lines of the same type within frame correspond to numbers of subjects of 1, 2, 
3 and 4 as described in the text. 



This would be sufficient for OLS to remove all bias for teachers beyond the first 
grade if students were selected into classes on the basis of intercepts alone; how- 
ever the selection on growth ensures that OLS remains biased. Alternatively, 
the mixed model estimator is generally effective at removing bias, particularly 
when a is small, when the bias is uniformly small and effectively zero when mul- 
tiple subjects are available. As a grows the bias for the mixed model estimator 
remains for first grade teachers, and when a — 1, the mixed model estimator 
is inconsistent. This is analogous to the cases considered in Example [1] where 
treatment assignment was constant for all time points, because when a — 1 
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the first grade teacher effect is similarly constant for all time points. Thus in 
value added studies using mixed model estimators, it is customary not to report 
estimated teacher effects from the first grade of available teacher links (@; [soh 

5. Discussion 

The primary criticism of random effects or mixed model approaches to an- 
alyzing longitudinal data is that they will provide inconsistent estimates when 
unobserved individual effects are correlated with other variables in the model. 
Under the standard unobserved effects model, a fixed effects approach does not 
suffer these same shortcomings and so is widely cited as the preferable analy- 
sis technique. However, the results of the econometrics literature indicate that 
under the standard unobserved effects model, fixed and random effects estima- 
tors are very similar when individual heterogeneity accounts for much of the 
unexplained variation in outcomes and when many measurements are available 
for each individual. That is, although the random effects estimator can be in- 
consistent, the magnitude of its bias in practical applications is not necessarily 
large and the resulting estimates not necessarily poor. The main contribution 
of this paper is to demonstrate that this "bias compression" property of the 
mixed model approach extends beyond the standard unobserved effects model 
to a more general model that might be appropriate for the kinds of complex 
outcomes considered in social science research, and in particular may be ap- 
propriate for analyses of longitudinal student achievement data. This result, 
in conjunction with the well-established results about the circumstances under 
which mixed model estimators can be optimally efficient (Aitken's Theorem; see, 
e.g., Theil (j47l )). suggests that the mixed model approach may have benefits in 
longitudinal data analyses that are not widely appreciated. 

Intuitively, the mixed model estimator can mitigate bias because pre-multi- 
plication of measurements by serves as a type of regression adjustment. 



By the results for inverting partitioned matrices (|44[ ). pre- multiplication of the 
measurement vector y by results in values in which each score yu is re- 
placed by a scaled version of the residual yu — yu where ijit is based on the 
regression of yu on all of the other available measurements for the individual. 
This adjustment in some sense generalizes the adjustment made by the fixed 
effects estimator, which replaces each score by its deviation from the average 
score for each student. Differencing removes all bias due to individual hetero- 
geneity under the standard unobserved effects model, but generally would not 
be appropriate under the model in Equation[9l Under this general model, as long 
as individual heterogeneity can be adequately represented by a low-dimensional 
factor, and the signal about that heterogeneity is not swamped by the noise in 
the measured outcomes, then the residuals from regression adjustment can be 
approximately unrelated to the unobserved individual effects, and estimates of 
model variables can be approximately unbiased. The degree of bias compression 
improves as both the number of available outcomes, and the signal to noise ratio 
of those outcomes, increases. 
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Importantly, our simulation examples demonstrate that the bias compression 
can hold simultaneously over a suite of parameters whose dimension may be 
growing as the number of measurement grows, such as with the treatment- by- 
time interactions considered in Example [T] and the individual teacher effects 
considered in Example [H Allowing for this kind of generality in the models may 
be important in the very circumstances in which the mixed model approach 
might be warranted - when there is concern that the sequence of tests are not 
measuring the same construct in the same way over time. It also makes the 
mixed model approach particularly relevant to jointly modeling measurements 
on different tested subjects. Nearly all longitudinal achievement data series con- 
tain test outcomes from multiple subjects in each year, commonly including both 
math and reading, often science and/or social studies, and sometimes even mul- 
tiple scores for the same subject taken from different assessments. It is common 
for analyses of educational variables, particularly those using fixed effects ap- 
proaches, for each subject to be modeled independently and estimates reported 
separately for each subject (c.f. ( l^; 3^; [H)). Analysis of the parallel data se- 
ries from different subjects in a SUR framework (|5ll ) is not common practice in 
educational research. Our analytical and simulation results suggest that jointly 
modeling scores from multiple subjects increases the information available for 
controlling for individual heterogeneity, which can lead to more effective bias 
compression for all estimated treatment-by-subject effects simultaneously. That 
is, while the usual justification for SUR analysis is increased efficiency, our re- 
sults indicate that SUR analysis in the mixed model framework may also provide 
important bias reduction benefits. The benefits of exploiting the redundancy in 
repeated measures of multiple outcomes is also noted by Thum jish. Jointly 
modeling multiple subjects also helps to make the larger values of T considered 
in our simulation examples more tenable. While following students for T = 20 
years is unrealistic, using repeated measures on a vector of annual measurements 
from different subjects or different tests of the same subject makes achieving 
large numbers of test scores feasible, particularly with increased scope and fre- 
quency of standardized testing and the rapidly improving capabilities for linking 
these scores to students over time. 

The ability of the mixed model approach to perform well simultaneously 
for a number of parameters that grows as the number of test scores grows is 
particularly relevant for estimating individual teacher effects. For example, the 
TVAAS model l[4l] ) as applied in Tennessee and elsewhere estimates separate 
teacher- by-subject effects, analogous to Example [3] above, using up to 25 scores 
for individual students (five subjects for five years). William Sanders, the devel- 
oper of the TVAAS model, has claimed that jointly modeling 25 scores for stu- 
dents, along with other features of the TVAAS approach, is extremely effective 
at purging student heterogeneity bias from estimated teacher effects (personal 
communication). The analytical and simulation results presented here largely 
support that claim. It is worth noting, however, that TVAAS and related meth- 
ods of McCaffrey et al. (j33l ) and Raudenbush and Bryk (jSTI ) have the additional 
complexity of modeling individual teacher effects as random effects rather than 
unknown parameters of indicator variables in a regression model. This makes 
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the analytical considerations more difficult, but as discussed in Lockwood and 
McCaffrey (|28l ). we believe that the essential elements of the bias compression 
property of the mixed models discussed here carry over to this more complex 
class of models. 

One critical caveat of our findings is that there are circumstances in which 
the mixed model approach can fail. Our simulation examples showed cases in 
which the mixed model approach is unable to remove bias, even with increasing 
amounts of test score data on individual students. All of these circumstances 
occurred when student treatment status did not vary over time. This is analo- 
gous to the problems faced by fixed effects analyses with similar data, because 
the treatment variable is confounded with the average of the unobserved stu- 
dent effects for treated students. Intuitively it makes sense that using repeated 
measures on students to control for student heterogeneity generally is not going 
to be effective under these circumstances because there is no within-student 
information upon which to gauge an individual student's performance in the 
absence of treatment. 

A related circumstance when random effects and mixed models cannot con- 
trol for bias due to unobserved student heterogeneity is when the population is 
stratified as described in McCaffrey et al. ((ssh. A stratified population is one in 
which there are disjoint groups of students such that students within a group 
share teachers but students in different groups never share any teachers. For 
examples, students from different school districts where there is no interdistrict 
transfer are a stratified population. As discussed in McCaffrey et al. (fssh . dif- 
ferences in strata means in 5 are not removed by pre- multiplication by R~^. 
This is because the teacher effects from different strata form "treatments" that 
are constant for students across the entire time period of the data and, as with 
constant treatment assignments, random effects cannot mitigate bias. 

Another potential limitation of our results is that our analytical results, and 
all of our examples other than Example [H treated R as known (using the in- 
feasible GLS estimator), whereas R must be estimated from the data in nearly 
all practical settings. When T scores are being modeled, R has T{T + l)/2 un- 
known parameters if no additional parametric structure is imposed, and unless 
the number of students is large relative to T it is likely that R may be estimated 
with substantial error. Moreover there are cases when R may not even be esti- 
mated consistently, such as when students do not switch treatment status over 
time as in Example [1] Our results from Example [2] where we compared the bias 
compression from the mixed model estimator using both the known R and R 
estimated from the data warrant cautious optimism, at least about the effect of 
estimation error in R. In that case the relative abilities of these two estimators 
to compress bias were almost identical, even for T = 20, where our estimated 
R had 210 parameters and only 1000 students were used in the simulation. 
We obtained similar results in auxiliary simulations to Example [1] (not shown) 
where we estimated the model parameters, including i?, using a Bayesian model. 
We conjecture that for the Rs that are likely to exist in longitudinal student 
achievement data series - where a large portion of the residual variance is dom- 
inated by a low-dimensional student-specific factor - that estimation error in R 
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is not likely to substantially degrade the bias compression property of the mixed 
model estimator because the redundant information available in the vector of 
scores is likely to lead to estimates of yu that are largely insensitive to error 
in the estimated coefficients of the regression adjustment performed by . 
Further exploration of this issue in settings more complex than that considered 
in Example [2] is an important area for future research. 

Also, all of the cases considered here had balanced, completely observed score 
data for all students. Actual data sets invariably contain a substantial amount of 
missing test score information, and when multiple cohorts are being modeled, it 
is common for different cohorts to have different configurations of available test 
scores. To explore the effects of missing data, we expanded Example[2]by adding 
cases where 50 percent of the data are missing at random. The bias continues 
converge to zero with increasing T but the decay is reduced so that the bias 
with T = 20 and 50 percent of observations missing is similar to the bias with 
T = 15 and all the data are observed. This suggests that our general findings 
about the bias compression of the mixed models approach are not invalidated 
by the complexities of missing data, but it is likely that incompleteness in the 
test score data will in general degrade the bias compression to some extent. On 
the other hand, the mixed models approach makes use of all of the information 
available for each student in estimating the unknown parameters - in essence 
estimating yu from the regression on the available scores for each student - and 
so might lead to particular efficiency gains relative to other approaches when 
missing data are substantial. 

In summary, our results indicate that the fact that mixed model approaches 
to longitudinal data analysis can lead to inconsistent estimates does not mean 
that the estimates are necessarily poor in any given instance. When applied 
to longitudinal data involving a large number of correlated measurements on 
individuals, they can provide nearly unbiased estimates even under relatively 
complex heterogeneity models involving multiple, unobserved individual-specific 
attributes whose relationship to the observed measurements varies across those 
measurements. Importantly, the bias compression is a byproduct of GLS esti- 
mation and does not require specification of Ri - for example, it is not nec- 
essary to decide if a one-factor or two-factor heterogeneity model is more ap- 
propriate. This kind of "black-box" robustness of the mixed model approach 
might be beneficial in circumstances such as longitudinal achievement data se- 
ries where the true heterogeneity model may be complex and obtaining formally 
consistent estimates may require more advanced methods. For example, Ahn, 
Lee and Schmidt (0) and Han, Orea and Schmidt (flsh develop and compare 
consistent generalized method of moments and concentrated least squares es- 
timators under structural models similar to our Example [H where the coef- 
ficients on the individual-specific factors vary across measurements. Similarly, 
covariance structure modeling approaches (0; [13; 13) are well-suited to han- 
dling cases where factors are related to measurements in time-varying ways. 
Understanding the performance of the mixed model approach in terms of both 
bias compression and precision relative to these alternatives warrants further 
study. 
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It also important that future work study the performance of the mixed model 
approach relative to empirical modificiations to the standard fixed effects model 
that attempt to address some of the complexities discussed in this article. For 
example, with longitudinal achievement data, analysts often internally stan- 
dardize each test measure to have constant variance of one or replace scores by 
ranks or transformed ranks prior to analysis as a strategy to increase the plau- 
sibility of the standard unobserved effects model when there are concerns about 
the comparability of scales across measurements (|l4l : Il5t 1521 ) . This procedure is 
appropriate under particular instances of the model in Equation [9] but not in 
general. Thus standard fixed effects approaches applied to the transformed data 
arc not likely to provide formally consistent estimates, and it would be useful 
to understand how the degree of inconsistency compares to the mixed model 
approach. Similarly, it is a common strategy for analysts to use fixed effects 
on student timepoint-to-timepoint gain scores rather than student level scores 
when there are concerns about selection bias due to differential grade-to-grade 
growth in achievement (13; IS)- This approach has strict data requirements be- 
cause multiple year-to-year gain measures must be available on a student in 
order for that student to contribute to the estimation of the model parameters, 
which can result in substantial reduction in the amount of usable information 
given the degree of missing data commonly found. The mixed model estimator 
does not impose similar restrictions, and so might have lower mean squared 
error even though it may be inconsistent. 



6. Appendix A 



Theorem. Let Ai and be defined as in Section\^ Then sufficient conditions 
for the elements of R^^A to go to zero uniformly as T oo are: 

1. The smallest eigenvalue of A'{^^^ Ai goes to infinity as T oo; and 

2. There exists a number C independent of T such that the elements an of 

the symmetric square root of satisfy X]t=i I'***! ^ '^^^ * 

Proof. Throughout, all matrices except Si and its root are assumed to depend 
on T so wc suppress that notation. Because R^ A = (/„ ® R^^){In (S) Ai) = 
In'^ Ri^Ai, it is sufhcient to consider only the elements of R^^Ai. Because all 
matrices would thus be subscripted by "1" , we suppress that notation as well 
and use, for example, R for Ri, A for Ai, 'S' for ^i, etc. We also assume that 
all matrix roots are symmetric roots. 



Recall that R = ASA' + By the Schur complement formula (|44f ) 



R 



/ - AS^/^il + S^^'^A'^-^AS^^'^y^S^^'^A'-^-^ 



so that 
R ^A -- 
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Let X = ^-i/a^^i/z of dimension (T x d). Then 

R^A = ^-'^'^x [/-(/ + x'xy^x'x] s-^/^ 

Singular value decompose X as C/A^/^ V where t/ is (T x d) with orthonor- 
mal columns, A^/^ = diag{y/Xl, . . . ,^fXd), and V is (d x d) and orthogonal. 
Because A is assumed to have full column rank, X has full column rank and 
so V\m > for m = 1, . . . ,d. Note that X' X = S^I^A'^-^AS^'"^ = VKV 
where Ai, . . . , are the eigenvalues of X' X. 

Now consider the matrix 

[/-(/ + x'xy^x'x] = [/-(/ + vKV'y^vW] 

= [VV - V{I + K)-^V'VAV'] 
= [V{I-{I + A)-^A)V'] 
1 1 



Vdiag{ 



1 + Ai 



Thus 



Vdiag{ 



l + Ai 



■ ' 1 + Ad 



)V' 



1 + Ad 



V 



g-1,2 



where A* is diag{^/\l/{l + Ai), . . . , VAd/(l + Ad)). 

An arbitrary element r-y of ii^^A is the inner product of a row a, of yp^^/^ 
and a column bj of C/A*VS^~^/^. Let s be the absolute value of the largest 
element of S^^^"^. By condition 1 of the theorem and Lemma ??, all the eigen- 
values of X' X are getting arbitrarily large for sufficiently large values of T, so 
that for any e > there exists a such that -\/Am/(l + Am) < €/Csd^ for all 
m = 1, . . . ,(i and for all T > T^. Because V is an orthogonal matrix and the 
columns of U are orthonormal. their elements cannot exceed 1 in absolute value, 
and so the absolute value of the largest element of UA*VS~^^^ is bounded by 
e/C for all T > T,. Then, for any i and j and for all T > T„ 



where the last inequality follows by condition 2 of the theorem. □ 

Lemma 1. Let B be a positive definite d x d matrix and let B^/^ be a sym- 
metric root of B . Let Mt be a sequence of d x d matrices. Let At denote the 
smallest eigenvalue of Mt o,nd lot denote the smallest eigenvalue of Qt = 
B^/^MtB^^^ . Then Xt ^ oo as T ^ oo if and only if lot ^ oo as T ^ oo. 
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Proof. Suppose At ^ oo as T ^ oo. Let "(/'mm > denote the smallest eigen- 
value of B. Then for every d-vector x such that x'x = 1 

X Qtx ~ (x Bx)- 



x'Bx 
x'Qtx 



min t T-> 

x'Bx 



u'Mtu 1 

V'mm ; , where u = B ' x 

u'u 



Because B is positive definite, division by x'Bx is well defined. By assumption 
tpmin^T ~^ OO with T SO that ujt, the minimum of x'Qtx, must converge to 
infinity as well. 

Now suppose that lot — > oo as T cx). Let ipmax > denote the largest 
eigenvalue of B. Then for any d-vector a, a'Ba < ijjmaxd'o,- Now, let it be a 
vector such that At — {u'Mtu)/u'u and let x = B^^^^u so that B^^^x = u. 
Then 

x'B^I^MtB^/^x 

At — 



.1 /;./■ 

x'B^/^MtB^/^x 

'1 



> 



> > OO as J — > oo. 



□ 
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